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This MATLAB program calculates the dynamics of the reduced density matrix of an open quan- 
tum system modeled by the Feynman- Vernon model. The user gives the program a vector describing 
the coordinate of an open quantum system, a hamiltonian matrix describing its energy, and a spec- 
tral distribution function and temperature describing the environment's influence on it, in addition 
to the open quantum system's intial density matrix and a grid of times. With this, the program 
returns the reduced density matrix of the open quantum system at all (or some) moments specified 
by that grid of times. This overall calculation can be divided into two stages: the setup of the 
Feynman integral, and the actual calculation of the Feynman integral for time-propagation of the 
density matrix. When this program calculates this propagation on a multi-core CPU, it is this 
propagation that is usually the rate limiting step of the calculation, but when it is calculated on 
a GPU, the propagation is calculated so quickly that the setup of the Feynman integal actually 
becomes the rate limiting step for most cases tested so far. The overhead of transfrring information 
from the CPU to the GPU and back seems to have negligible effect on the overall runtime of the 
program. When the required information cannot fit on the GPU, the user can choose to run the 
entire program on a CPU. 
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INTRODUCTION 



One is very often interested in how the reduced density matrix p of an OQS (open quantum system) changes with respect to time 
{pD Ereduced density matrix dynamics). The most popular open quantum system model to date is the Feynman- Vernon model (see 
section II,), and for this model, a formally exact mathematical description of ihepD was deveoped by Richard P. Feynman and two 
of his PhD students Frank L. Vernon Jr. and Willard H. Wells in the late 1950s H^l- In this mathematical formalism, the pD is 
represented in terms of Feynman integrals, for which no general closed form analytic solution is known. Therefore, to determine the 
numerical values of the elements of the reduced density matrix at a given point in time, one needs to evaluate these Feynman integrals 
numerically. Early work in numerically calculating these Feynman integrals used Monte Carlo methods, but since the integrnds are 
highly oscillatory, deterministic algorithms have been the most popular methods for these numercal Feynman integral calculations since 
breakthroughs in algorithmic development were made , mainly between 1992 and 2001. 

The quasi-adiabatic propagator FeynmarQ integral (QUAPI) technique indroduced by Nancy Makri in 19920 helps the numerical 
calculation of these Feynman integrals converge with a larger sized time step used in the discretization of the Feynman integrals, when 
the bath is nearly adiabatic. Today's most popular algorithm for calculating these numerical Feynman integrals (whether using a 
quasi-adiabatic propagotor or not) uses a representation of the numerical Feynman integrals in terms of matrix-vector-like operations, 
and is called the 'tensor propagator scheme'. The most evolved form of the tensor propagator scheme was explained by Nancy Makri 
and Dmitrii E. Makarov in 1995 [12], although the original formulation of it was introduced by them in 1994[8j and explained in more 
detail in their 1995 paper 11 1. The filtered propagator functional (FPF) introduced by Eunji Sim and Nancy Makri in 1996 |l9l? ] is an 
improvement of the tensor propagator scheme which saves computer memory by using Monte Carlo importance sampling to filter which 
Feynman paths are included in the Feynman integral, but Eunji Sim's 2001 algorithm [18] is an alternative way of filtering the Feynman 
paths within the tensor propagator scheme which does not require Monte Carlo importance sampling and has been shown to save even 
more memory than the FPF technique, without sacrificing accuracy. I am not aware of any further algorithmic developments for the 
calculations of numerical Feynman integrals for the pD of the Feynman- Vernon model in the first decade of the 21st century. Very 
recently another technique was introduced which can significantly reduce the number of Feynman paths required in the calculation of 
these Feynman integrals without significantly deteriorating the accuracy, which is particularly useful when the reduced density matrix 
changes so quickly that resolving its dynamics requires it to be calculated much more often than the response function of the OQS's 
environment needs to be sampled for a chosen amount of accuracy [l|. 

In the MATLAB program described in this paper, the main calculation which propagates the reduced density matrix in time using 
Feynman integration, is mainly carried out by the MATAB function BSXFUN (binary singleton expansion function), which was first 
introduced in March 2007 in version R2007a of MATLAB. This algorithm which uses BSXFUN is the fastest MATLAB implementation 
for the tensor propagator scheme metioned above, with which I have experimented. It is also extremely well suited for performance on 
GPUs (graphical processing units). 

The company AccelerEyes developed the commercial software Jacket (for which the 1st beta version was vO.2 which was released in 
June 2008), which allows MATLAB functions to be performed with significant speedup on GPUs. BSXFUN was first incorporated into 
Jacket in vl.2, which was released in October 2009. In April 2010 a viral blog entry was published!? ], entitled 'Crushing MATLAB 
Loop Runtimes with BSXFUN', which demonstrated that for the example calculation in that study, BSXFUN performed about 45 
times faster than the naive MATLAB algorithm, when both calculations were executed on the same CPU. Even more impressively, 
when this BSXFUN calculaton was performed on their GPU, the calculation was sped up by almost another 6 times! 

MATLAB itself did not have a built-in capability to execute caculations on GPUs until September 2010 in version R2010b, and it 
did not support the use of the built-in BSXFUN function on GPUs until very recently (March 2012 in version R2012a). 



SETTING 

The most popular OQS model is currently the Feynman- Vernon model. In the Feynman- Vernon model, the OQS (whose coordinate 
is denoted by s) is coupled linearly to a set of QHOs (quantum harmonic oscillators) Qt'- 

H = HoQS + -ffoQS-bath + -f^bath (1) 
= ^OQS + E CftSQ + E (s^kQk^ + hmni^lQl) ■ (2) 

K K 

In most models the QHOs span a continuous spectrum of frequencies and the strength of the coupling between the QHO of frequency 
uj and the OQS is given by the spectral distribution function J{u!): 

^H = 5E^^(^-^«)- (3) 

2 V "t-kWk 



^ The term 'path integral' is used more commonly than 'Feynamn integral' here, but this term is ambiguous. Currently, the first result on the search engine 
at www.google.com, when the search query 'path integral' is entered, is a Wikipedia page that currently links to three different meanings of the word 'path 
integral': (1) line integral, (2) functional integration, and (3) path integral formulation. Only the third of these is unambiguously the Feynman integral 
discussed in this paper. The 'line integral' is an integral over a path, rather than over a set of paths; and the term 'functional integral' can refer to at least 
three types of functional integrals: (1) the Wiener integral, (2) the Levy integral, and (3) the Feynman integral. 
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For the hamiltonian of the Fey iiman- Vernon model, the bath response function a{t) is the following integral transform of J{io), in 
terms of the inverse temperature /3 = : 

a{t) = — J J(cL;)^coth^^^j cos(wi) - isin(tjt) jdw (4) 

= - / j^^e-'-'doj, J{-oj) E J(^) (5) 

2sinhl^J 

TT J-oo 1 - exp(-pcLin.) 

where, equation |6] can be written in terms of the Bose- Einstein distribution function with x = f3ujh: 

j'Bosc-Einstcin _ ^ 

1 - exp(-x) 

Assuming that the density matrix of the OQS plu s its environment at i = is ptotai = p(0) ® g^/^^fbati,^ ^i^^ reduced density matrix 
elements at time t = N • At are given exactly by[10!| (the integrals are used if the coordinate s of the OQS is continuous, and the 
summations are used if it is discrete or discretized) : 

^ /n-1 \ N-l 

where the discretized influence functional can be written as: 



Afc„ 



/ Afc„„x k \ 

/ = exp - £ E(4-'Sfc)(??fcfc'4'-'7fcfc'Sfc') > 



A;=0 fc'=0 



and the 77 coefficients are given in terms of the bath response function a(t) by 0: 

-(fe+l)At ^(fc'+l)At 



/ / a{t' - t")dt"dt' ,k+k', (10) 

JkAt Jk'At 
/~(fc+l)At ft' 

/ / a{t' -t")dt"dt' ,ki{0,N} , (11) 

JkAt JkAt 
f-NAt r^*/2 

/ / a{t' - t")dt"dt' , (12) 

JAfAt-At/a Jo 

f' f\{t' -t")dt"dt\ (13) 
Jo Jo 

r-NAt ft' 

\ / - t")dt"dt' , (14) 

JATAt-At/a jAfAt-At/2 
/~(fc+l)At /-At/2 

/ / a{t' - t")dt"dt' , (15) 

JfcAi Jo 
fNAt f{k+l)At 

/ / a{t' -t")dt"dt' , (16) 

JNAt-^t/2 JkAt 

and can very often be represented by closed form analytic expressions [H The purpose of the program is to calculate all (or some) 
of the matrix elements {s'^\p{t)\s]^) at a time t (or at all of the times {tk = k ■ Aij^p), given the system coordinate s, its hamiltonian 
Hqqs and initial density matrix p(0), the spectral distrution function J{lli) or the bath response function a{t), the temperature T, 
and the convergence parameters {At, A/cmax}- 

To save time in recalucalting certain quantities involved in the overall calculation, the program stores four variables when N > Afci„ax) 
each containing 



Vkk' 
Vkk 

VNO 
VOQ 

riNN 

VkO 
VNk 



^ which is actually a function 
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(double precision) floating point complex numbers (here M is the dimension of the Hilbert space of the OQS). In my experience, using 
single precision numbers has sometimes given values for p(t) that are very slightly (but observably) different from when double precision 
numbers are used, and since the purpose of the numerical Feynman integral technique described here is to calculate exact values of 
p{t) (usually for benchmarking less accurate, less computationally expensive methods), double precision is recommended (although the 
program will still work if one desires to keep less digits to save memory). When these complex numbers are double precision, they take 
up 16 bytes of the computer's memory, so the overall amount of memory required to store the four variables mentioned just before 
equation [T7] is 



PMC = 4 • 16 • Af 2(Afc.„.„+i) (^8) 
= 64-M2(^'^-— +1) (19) 

bytes of information overall (here PMC stands for primay memory cost, since the storage of these numbers almost always constitutes 
the vast majority of the memory required for the overall calculation, and the PMC is the most important computational complexity 
measure associated with the program described in this paper). This memory cost could be reduced if one of the filtering techniques 
mentioned in the introduction section of this paper were to be implemented; however the current version of the program does not have 
either of these algorithms implemented. 



MODEL SYSTEM FOR EXAMPLE CALCULATIONS 

All example calculations in this paper will be for a model system with: 



S6{0,1}, (20) 




J{uj) = Auj^e^-"/''''^' ■ (22) 



This OQS model has been used to describe many experiments (some examples are [16|) on laser-driven single quantum dots after a 
rotating wave approximation (RWA). The specific parameters used in this paper will consistently be: 



n{t) = -/8 , 

A = 7r0.027ps^ , 
LOc = 2.2ps"^ , and 
r= 25K . 



(23) 
(24) 



The values of A and ujc come from the experimental study of this model system described in [16j, and have been used in many 
computational studies that use this same model system (some examples are [l, 1^ The values of i7 and T were chosen in order 

for the damped Rabi oscillations to last for a long time (for better evaluation of the performance of the program's propagation of p 
with respect to time), while being damped enough to be physically interesting, and while keeping fl and T physically realizable. 



PERFORMANCE 



All calculations reported in this paper were performed on an Intel Xeon/Nehalem CPU with 2.8GHz clock rate, and/or on an Nvidia 
C2020 GPU which are part of the SKYNET Viglen/NVIDIA hybrid GPGPU Cluster provided by the Oxford Supercomputing Centre 
at the Oxford e-Research Centre at University of Oxford. This this cluster is continually in use by many people, making its performance 
variant with respect to time. This may be reflected in some of the performance times reported below, but it is still valuable to see the 
overall trends and comparisons. 
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Table I. Running the main propagation calculations {k > Afc)) on the CPU vs the GPU for varying values of Afcmax- The pimary memory cost 
(PMC) associated with each value of Afcmax is given by equation 1191 





PMC 


Run titnp on (^PTT (QppnnH'^l 


Rim i"iTTip nn (T^PTT ic;ppnnr]cil 


A nnrn"viTnatp ^t^ppHiiti nn C-IPTT 


2 


4.096 KB 


0.071580 


1.224592 




3 


16.38 KB 


0.081609 


1.226297 




4 


65.54 KB 


0.107413 


1.230001 




5 


262.1 KB 


0.209927 


1.22675 




6 


1.049 MB 


0.641943 


2.177962 




7 


A 'i r\ A TV TT^ 

4.194 MB 


2.571893 


1.286302 


2x 


8 


16.78 MB 


16.347692 


1.766242 


9x 


9 


67.11 MB 


59.452344 


4.673941 


13x 


10 


268.4 MB 


291.731064 


15.580798 


19x 


11 


1.074 GB 


1198.620937 


21.130031 


18x 


12 


4.295 GB 


4911.546824 


Required data 


did not fit on the GPU 



Based on the results in table U we see that the speed benefit gained by running the main propagation calculations on the GPU 
rather than the CPU become more and more apparent as Afcmax is increased, until Afcmax = 11. For this reason, the performance tests 
reported below are for calculations where Afcmax = 11. 

Figure [T] presents an estimate of the amount of time spent by the program on each line of the part of the code surrounding the lines 
that execute the main propagation calculations (lines 170-171 for the CPU version of the program, and lines 164-165 for the GPU 
version). The calculation of the summation to determine p at the times {tk = ^'^OfciAfc executed on line 180 for the CPU version 
and 169 for the GPU version. 



Figure 1. Top: Profiler output when the main propagation calculations are performed on the CPU (top) and GPU (bottom). 



11.95 12 15 6 Stirf.es j.perr(.arierjt=Stlir.e3lpe3:rf.arient . ''I_ir.li[ [indices [ : rend-k) -1) "M2+iridices [ : ,end) ,li+l) ; 

11.74 12 157 Ktin".esIperrf.inentErid=Stiir.esIperir.arLerLtEnd. ''I_rf.k:End [ [indices end-k] -1 ) "■M2 + indices [ : , end) fV 

12 158 end 

15 J 

< 0.01 1 160 indices=[]; %clear [' indices ■ ) %when not using parfor 

1 167 Ktiir.esIperK.anent=3:eshepe [Zttiir.esIperK.anent ,M2 , [ ] )■ ; 

1 168 Ktiir.esIperK.anentEnd=reshape [ivtime3lpernkanerit£iidrM2, [] ) ; 

1 169 tic; 

1 170 for J=deltaXrcax+l : f inalPoint 

523.75 9-39 171 Aend=reshape [bsxfun l@times, reshape [sum (resliape (A, r],M2},2),l, [] )■ ,KtiinesIpermanentEiid) , 

524.25 939 172 A=resliape (bsxfim ( @times , reshape (.sum [ reshape [A, [] ,]J2) , 2> , 1, [] j jjCclmesIperTriJ'tiPTit-) ^ 11*1) ."^ %couli 

0.02 9-B9 179 Aend=Te3hdce f Aend. M2 ^ [ M ; 

130.06 95 9 180 rho ( 1 , J+1 ) =s-jir. [ Aend ( 1 , : ) , 2 ) ; 
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tic: 



for J=deltal^ii.ax+1 : f inalPcint 

Aen(i=reshape [tsxf 'an (Stirr.eSf reshape [ s'sn. [resnape [A, [ ] , M2 ) , 2) , 1, [ ] ) ,KtiniesIpei:inanentEnd} r [ ] , 1 } ; %se 

A=re3nape [bsjtf 'an[@tir.eSf leshape [ soir. [reshaps [A, [ ] ,M2] , 2 ) , 1, \ ]] , Ktiir.eslperir.anent ) , %CQuld be 

Aer.d=re3r.ape [Aer.d^M2 f ) ; 
rho(l, J-l}=siim(A2nd(l, : ) ,2) ; 



The most striking observation from figure [T] is that while on the CPU version, the propagation calculations (lines 170-171) are the 
obvious bottleireck, when these calculations are done on the GPU (lines 165-166 of the GPU version), the profiler estimates that these 
calculations take less than a second in total (despite being run 989 times each) ! This means that the new bottleneck for the GPU 
version is the summation on line 169, that caculates p at 989 successive points in time, and according to the profiler's estimate takes 
42.3 seconds in total (0.0428 seconds for each time value at which p is calculated). 

Very often, one is only interested in p{t) after a certain amount of time tspecific has passed. For example, in the study of laser-driven 
quntum dots, one is often interested in the population of excitons after a certain amount of time (given by (l|p(ispocific)|l) = /OiLtspocitic 
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Figure 2. CPU time and GPU time vs tspcciflc when p is only calculated at the one time Iq = tspcciflc- 




to [picoseconds] 



in the model studied in this paper), and how this quantity changes with respect to a physical parameter. The experimental data in 
figure 1 of [f7] and in figures f and 2 of (T6| report the photocurrent at a specific time ispccific (the photocurrent is related to Pii,ts,p„citic)i 
as a function of the 'pulse area' (which is related to n{t) in our hamiltonian given by equation 12 ft of the laser that drives a quantum 
dot. Various figures in theoretical and computatonal studies of this same system (see for example [i6l. [tI. [TsI. Ii4. .2ft, .21] . and many more) 
present Pii.tgp^^itic ^ function of the pulse area. In all of these cases, one is not interested in p(t) at every point in time, but only in 
p(ispocific), so line 169 of the GPU version (or line 180 of the CPU vesion) of the code in figure [T] would only have to be executed once 
instead of 989 times, for each pulse area for which one desires to know p(tspocific)- 

For this reason, the program described in this paper allows the user to specify the input variable allPointsOr JustFinalPoint by 
either the string ' allPoints ' , which runs line 180 of the CPU version, or line 169 of the GPU version, for all values of {t = k-At}^^^/, ; 

or by the string ' justFinalPoint ' , which runs those lines only once, at k = N = . All calculations for which the performance 

was reported in figure [2] were run with allPointsOr JustFinalPoint = 'justFinalPoint' . Figure [2] not only demonstrates that 
the benefit of running the main propagation steps on the GPU rather than the CPU grow with the amount of time for which the system 
is being simulated, but also demonstrates very clearly the linear scaling of the markovian tensor propagator algorithm of Makri and 
Makarov. 

DISCUSSION 

It is worth noting that in the program described in this paper, based on the profiler output in the bottom panel of figure [1] the main 
propagation calcualtions and each summation used to determine the numerical values of p at a particular time t, take a very small 
(almost negligible) amount of time to compute. This means that other parts of the program, which set up the numerical Feynman 
integral calculation (such as the calculation of the 77 coefficients presented in equations 1101 to I16|) can play an equally important, or 
dominant role to the overall runtime of the program (especially if not implemented efficiently). In figure 1 of [2], it was shown that 
the calculation of ?7fefe' , k + k' (equation [10] of this paper), for the same spectral distribution function used in the calculations presented 
in this paper (given by equations [22 | [23l and [24 ]) . took more than 300 seconds to calculate on the CPU used when using Mathematica 
8's NIntegrate function to numerically integrate the most frequently used expressions for these quantities, while it took less than half 
a second to compute on the same CPU using Mathematica 8's Sum function to evaluate the expression introduced in [2] for the same 
quantities. For tliis reason, using the analytic expressions for the rj coefficients first given in \2\ will help to maximize the benefit from 
the GPU version of the program described in this paper - and if one were to numerically compute the historically more traditional 
expressions for these coefficients, the time taken to execute this numerical integration would in many cases dominate the execution 
time for the rest of the program's calculations. 

It is also important to discuss the computational overhead associated with transfering data from the CPU to the GPU and back, 
especially the GPU deals with a very large[f| amount of data for the calculations performed by the program described in this paper. 
The transferring of the relevant data from the CPU to the GPU is done by lines 160-163 of the GPU version of the code, and the 
transferring of the result {p{tk)} stored in the array rho. is transferred back from the GPU to the CPU on line 171 of that code. The 
result of the profiler's estimates if runtime displayed in the bottom panel of figure [T] suggest that the transferring of the data to and 
from the GPU is performed in real time. The profiler is perhaps underestimating the runtimes of at least one of the lines of code 



^ When the accuracy of the calculations is important, the size of the data involved in the PMC will ideally be as large as the CPU's memory will allow. 
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performed on the GPU (between lines 160 and 172). This can be seen by observing that figure [5] shows that lines 160 to 172 of the 
GPU version of the code can take up to one minute, and the sum of all the runtimes estimated by the profiler for these lines never 
reached one minute when these calculations were redone with the profiler o If it is true that the profiler has underestimated the 
runtimes of lines 160-163 in figure [1] which transfer the relevant data from the CPU to the GPU, this might at least partially explain 
why line 165 was reported to have taken significantly longer than line 166 in the bottom panel of figure [1] despite the top panel of the 
same figure showing that in the analogous lines for the CPU version of the code (lines 171 and 172), line 172 is actually faster than 171 
! In any case though, the transfer of the data to and from the GPU cannot be significant to the overall runtime of the program, since 
the exact same data is transfered in every instance represented by a red circle in figure [2l and therefore the overall runtime associated 
with transferring the data should be within the scale of the first red circle. The linear increase seen in the red curve is due to the fact 
that lines 165 and 166 are repeated more and more times in the for loop as N is increased. 

The program described in this paper is open source, and users are encouraged to contribute on Github by implementing techniques 
to improve its efficiency, such as the OFPF technique (l8| and the interpolation method f}^ mentioned in the introduction, and the 
extension to allow for models where the OQS can couple to more than one QHO of a given frequency using the more general influence 
functional first presented in |15|. 
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